The main approach is to predict gene expression from the mean methylation level of the corresponding promoter region. Our approach is to fita methylation profile for each promoter region and then predict gene expression from the shape of methylation profiles. This can be thought as extracting more features from DNA methylation data and using them to predict transcription abundance. However, this approach is computationally more expensive since for each promoter we need to maximize the following quantity:
\[ p(\mathbf{y}_{i}|\mathbf{f}_{i}) = \prod_{l=1}^{L} p(y_{il}|f_{il}) = \prod_{l=1}^{L} \binom{t_{il}}{m_{il}} \Phi(f_{il})^{m_{il}} (1 - \Phi(f_{il})\big)^{t_{il} - m_{il}} \]
The f are basis functions which are squashed through the probit transformation in order to lie in the [0, 1] interval. Currently, rbf basis functions are implemented.
After learning the methylation profiles for each region, we use a regression model to predict gene expression. The default is to use the SVM regression model, but other approaches are implemented such as Random Forests and Multivariate adaptive regression splines.
We model the methylation profiles using 9 RBF kernels (i.e. basis functions). The learned parameters for these basis functions will be our extracted features that we will use to predict gene expression levels. A region of 10kb is taken for the promoter regions, which are centred around the TSS, so 5kb upstream and 5kb downstream of TSS. The chromosomes chrLambda and chrM where discarded from further analysis. Reads with less than 6 coverage were also discarded and promoter regions should contains at least 12 CpGs so as to be considered for further analysis. The gene expression data are in FPKM. We log2 transform the FPKM values to reduce variation and to avoid the \(log2(0)\) issue, \(\alpha = 0.1\) was added to all the counts. Finally, we remove genes that have expression levels above 600, considering them as noise.
# Plot some learned methylation profiles for DF Young mouse
t = 1569
gene_name <- df_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = df_young_X,
fit_prof = df_young_out_prof,
fit_mean = df_young_out_mean,
title = paste0("Gene ", df_young_genes$gene_name[t]))
# Plot some learned methylation profiles for DF Young mouse
t = 1533
gene_name <- df_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = df_young_X,
fit_prof = df_young_out_prof,
fit_mean = df_young_out_mean,
title = paste0("Gene ", df_young_genes$gene_name[t]))
t = 2133
gene_name <- df_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = df_young_X,
fit_prof = df_young_out_prof,
fit_mean = df_young_out_mean,
title = paste0("Gene ", df_young_genes$gene_name[t]))
# Methylation profiles for the Rem1 gene
t = 11428
gene_name <- df_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = df_young_X,
fit_prof = df_young_out_prof,
fit_mean = df_young_out_mean,
title = paste0("Gene ", df_young_genes$gene_name[t]))
t = 11433
gene_name <- df_old_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = df_old_X,
fit_prof = df_old_out_prof,
fit_mean = df_old_out_mean,
title = paste0("Gene ", df_old_genes$gene_name[t]))
t = 11455
gene_name <- N_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = N_young_X,
fit_prof = N_young_out_prof,
fit_mean = N_young_out_mean,
title = paste0("Gene ", N_young_genes$gene_name[t]))
t = 11474
gene_name <- N_old_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t,
X = N_old_X,
fit_prof = N_old_out_prof,
fit_mean = N_old_out_mean,
title = paste0("Gene ", N_old_genes$gene_name[t]))
As we can observe the latent basis functions can capture the rich patterns present in the WGBS data. We argue that there is a functional role for the shape of methylation profiles.
Using the learned profile parameters as input features, we train an SVM regression model using 70% of the data as training data, and the rest 30% as test data. So the training data consist of 10 input features w and the corresponding gene expression level y for each gene promoter region. Below we can see how well we can predict gene expression levels from DNA methylation patterns in the test data. The x-axis corresponds to the measured gene expression and the y-axis to the predicted gene expression levels. R denotes the Pearson’s correlation coefficient.
Our next step is to ask if methylation patterns are global across different cell lines and different mouse models. That is, learning the methylation profiles (i.e. extract higher order methylation features) for one mouse model, e.g. DF Old, can we use them to predict gene expression levels for a different mouse model, e.g. N Old?
The process is the following:
If there are global proerties of methylation patterns, then we expect similar correlation values to what we found on the previous section.
Confusion matrix of predictions from one mouse model to the other mouse models when using the methylation profiles as input features in the SVM regression model.
Confusion matrix of predictions from one mouse model to the other mouse models when using the methylation profiles as input features in the SVM regression model.